[None][chore] Fix lock_infra_error by yufeiwu-nv · Pull Request #15213 · NVIDIA/TensorRT-LLM

yufeiwu-nv · 2026-06-10T08:55:03Z

Introduced a new helper function to identify lock-infrastructure errors, improving the robustness of the config_file_lock context manager. This change allows for better handling of temporary directory fallbacks during lock acquisition failures, ensuring that exceptions are properly propagated and logged.

Signed-off-by: yufeiwu-nv 230315618+yufeiwu-nv@users.noreply.github.com

@coderabbitai summary

Description

Problem

config_file_lock() re-raises filelock.Timeout instead of using its tempdir fallback. The errno-narrowing added in #11960 — isinstance(e, OSError) and e.errno not in {EACCES, EPERM, ENOLCK, ESTALE} — was meant to let non-lock OSErrors propagate. But filelock.Timeout is an OSError subclass with errno=None, so it satisfies that condition and gets re-raised, defeating the very lock-acquisition-timeout fallback the function is supposed to provide.

Impact

When multiple ranks load a trust_remote_code model concurrently (tp/ep > 1), they contend on the single global _remote_code.lock. The ranks that time out crash during executor init and trigger MPI_ABORT — observed on the deepseek_r1_0528_fp4 ... ep:4-tp:4 perf test.

Fix

Refactor config_file_lock into a single-yield context manager that guards only the acquire() call. The yield is moved into else + finally release, so exceptions raised by the caller body (e.g. HF RepositoryNotFoundError, also an OSError subclass) propagate cleanly without a second yield. Fallback-eligible failures are now selected via isinstance — matching filelock.Timeout and PermissionError explicitly, plus NFS errnos ENOLCK/ESTALE.

Verification

Reproduced with a real-filelock multi-process contention test (1 holder + 3 workers): the shipped logic makes all waiting workers raise Timeout, while the fix lets them all fall back and succeed.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…formance tests and configurations - Updated model paths to include nemotron_3_ultra_550b_nvfp4 in HF_MODEL_PATH. - Added configuration settings for nemotron_3_ultra_550b_nvfp4 in pytorch_model_config.py. - Included new performance test cases for nemotron_3_ultra_550b_nvfp4 in test_perf.py and updated llm_perf_core.yml. - Cleaned up legacy model name handling in test_perf.py. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

…ts for nemotron and llama models - Reintroduced performance tests for nemotron_nano_12b_v2 and qwen3.5_27b models with various configurations. - Added performance tests for llama_v3.3_nemotron_super_49b with multiple input/output lengths and GPU configurations. - Ensured comprehensive coverage of performance benchmarks in the llm_perf_core.yml file. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

- Removed redundant test cases for llama_v3.1_nemotron_ultra_253b and adjusted the configuration for qwen3.5_122b_a10b. - Added back performance tests for llama_v3.1_nemotron_ultra_253b with various input/output lengths and GPU configurations. - Updated comments for clarity on the test cases included. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

Addresses CodeRabbit review: --log_level=info is a static literal and does not need an f-string prefix (ruff F541). Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

Introduced a new helper function to identify lock-infrastructure errors, improving the robustness of the config_file_lock context manager. This change allows for better handling of temporary directory fallbacks during lock acquisition failures, ensuring that exceptions are properly propagated and logged. Signed-off-by: [Your Name] <your.email@example.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

coderabbitai · 2026-06-10T08:56:46Z

📝 Walkthrough

Walkthrough

Refactors the HuggingFace cache lock acquisition to classify infrastructure errors and implement fallback logic. Adds a helper function to identify timeout and permission-related lock failures, then updates the context manager to acquire locks explicitly and conditionally retry with a tempdir-based lock when primary acquisition fails for infrastructure reasons.

Changes

Lock infrastructure improvement

Layer / File(s)	Summary
Lock error classification and fallback locking `tensorrt_llm/_torch/model_config.py`	Introduces `_is_lock_infra_error()` to classify timeout, permission, and specific `OSError` failures as fallback-eligible. Refactors `config_file_lock` to use explicit `acquire()`/`release()` calls, conditionally acquire a tempdir lock on infrastructure errors, and proceed without a lock if both primary and fallback acquisitions fail due to infrastructure issues.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][chore] Fix lock_infra_error' follows the template format and clearly describes the main change—adding infrastructure for handling lock errors—though 'fix' may be more descriptive than 'chore'.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description comprehensively explains the problem, impact, fix, and verification with clear sections following the template structure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/model_config.py`:
- Around line 72-73: The helper _is_lock_infra_error currently treats
filelock.Timeout as an infrastructure failure; update that function to stop
classifying filelock.Timeout as an infra error (it is an acquisition timeout,
not broken lock infra). Locate _is_lock_infra_error and remove filelock.Timeout
from the isinstance check so only true infrastructure errors (e.g.,
PermissionError or other genuine file-locking exceptions you want to keep)
trigger the tempdir/no-lock fallback paths; ensure PermissionError handling
remains unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b0c3216-d3ec-4a6d-a146-9bb344730ee7

📥 Commits

Reviewing files that changed from the base of the PR and between 90cb7ff and 2784114.

📒 Files selected for processing (1)

tensorrt_llm/_torch/model_config.py

Updated the logic in the _is_lock_infra_error function to better differentiate between lock contention and broken infrastructure. Enhanced the config_file_lock context manager to log warnings appropriately when lock acquisition fails, ensuring clearer error handling and fallback behavior. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

yufeiwu-nv · 2026-06-11T05:57:59Z

/bot run

tensorrt-cicd · 2026-06-11T06:03:30Z

PR_Github #53491 [ run ] triggered by Bot. Commit: 9e65d1e Link to invocation

yufeiwu-nv · 2026-06-11T07:02:22Z

/bot help

github-actions · 2026-06-11T07:02:32Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt-cicd · 2026-06-11T10:37:08Z

PR_Github #53491 [ run ] completed with state SUCCESS. Commit: 9e65d1e
/LLM/main/L0_MergeRequest_PR pipeline #42654 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yufeiwu-nv · 2026-06-12T06:25:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-12T06:30:42Z

PR_Github #53818 [ run ] triggered by Bot. Commit: 28c798e Link to invocation

tensorrt-cicd · 2026-06-12T10:58:03Z

PR_Github #53818 [ run ] completed with state SUCCESS. Commit: 28c798e
/LLM/main/L0_MergeRequest_PR pipeline #42935 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yufeiwu-nv added 6 commits June 9, 2026 12:20

Merge branch 'main' into bug

ea9ef23

Merge branch 'main' into bug

b84cfcd

[None][test] Remove unnecessary f-string prefix in build command

5c94d9b

Addresses CodeRabbit review: --log_level=info is a static literal and does not need an f-string prefix (ruff F541). Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

yufeiwu-nv requested review from a team as code owners June 10, 2026 08:55

yufeiwu-nv requested review from dongxuy04 and yizhang-nv June 10, 2026 08:55

github-actions Bot assigned yufeiwu-nv Jun 10, 2026

yufeiwu-nv mentioned this pull request Jun 10, 2026

[https://nvbugs/6115560][fix] catch OSError in config_file_lock for NFS compatibility #11960

Merged

yufeiwu-nv added 2 commits June 10, 2026 08:56

Merge branch 'main' into bug

2784114

yufeiwu-nv removed request for a team, dongxuy04 and yizhang-nv June 10, 2026 08:57

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/model_config.py Outdated

yufeiwu-nv added 2 commits June 10, 2026 09:25

Merge branch 'main' into bug

9e65d1e

Merge branch 'main' into bug

28c798e

yufeiwu-nv enabled auto-merge (squash) June 12, 2026 06:46

yufeiwu-nv added 2 commits June 12, 2026 15:13

Merge branch 'main' into bug

6b0d18f

Merge branch 'main' into bug

a16f6e0

Conversation

yufeiwu-nv commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Impact

Fix

Verification

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yufeiwu-nv commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

yufeiwu-nv commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

yufeiwu-nv commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yufeiwu-nv commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading